home
***
CD-ROM
|
disk
|
FTP
|
other
***
search
/
ftp.cs.arizona.edu
/
ftp.cs.arizona.edu.tar
/
ftp.cs.arizona.edu
/
icon
/
newsgrp
/
group95a.txt
/
000036_icon-group-sender _Mon Jan 30 15:36:49 1995.msg
< prev
next >
Wrap
Internet Message Format
|
1995-02-09
|
5KB
Received: by cheltenham.cs.arizona.edu; Mon, 30 Jan 1995 09:37:21 MST
To: icon-group-l@cs.arizona.edu
Date: Mon, 30 Jan 1995 15:36:49 GMT
From: goer@quads.uchicago.edu (Richard L. Goerwitz)
Message-Id: <1995Jan30.153649.11735@midway.uchicago.edu>
Organization: University of Chicago
Sender: icon-group-request@cs.arizona.edu
References: <3geb4t$cfn@magnum.convex.com>, <3gg0cv$jv8@hahn.informatik.hu-berlin.de>
Reply-To: goer@midway.uchicago.edu
Subject: Unicode (was Re: Linux, HPFS, and internationalization)
Errors-To: icon-group-errors@cs.arizona.edu
I'm cross-posting to the Icon newsgroup, because of the recent discussion
there of Unicode and internationalization issues. Followups, though, are
directed back to comp.os.linux.development.system:
In article loewis@informatik.hu-berlin.de (loewis) writes:
>
>In general, I believe internationalization efforts of Linux should
>introduce Unicode wherever reasonable - file names is one of the places.
>The question about sorting now is still: according to what rules. When
>displaying information to the user, you would like to follow the sorting
>rules of the user's native language. In the file system, all that counts
>is that you never lose accessability, as you point out. These are two
>different things, though.
I think that this was IBM's idea when tagging directory entries for
code page. They wanted different sort orders for different "locales"
(in this case identified with codepages). Just a hunch.
One of the problems with Unicode, incidentally, is that despite all the
hoopla, information about it is being disseminated very, very slowly.
And it is doubtful that Unicode will ever displace standards like Shift-
JIS in Asia. Also, note that if localization is a concern (as in the
above posting), Unicode isn't a cure-all. Unicode is kind of a super-
ISO 8859-1 in the sense that it doesn't tell you what language or locale
you're in. So, for example, if I run into an Arabic alif in Unicode,
I really don't know whether I'm looking at Arabic, Persian, or Urdu.
The problem is the same for the so-called "CKJ" languages, Chinese,
Korean, and Japanese.
This would be a good time for someone who's worked on Plan 9 to jump
in with advice. What would be a sensible way of migrating to Unicode
or other standards? Do we use UTF-8 (about which information is even
harded to come by than Unicode)? Or do we use some form of wide char
I/O, using straight Unicode? Or do we default to UTF-8 for backwards
compatibility, but provide facilities for straight Unicode?
As usual, I must confess that I'm not a software engineer. I'm in Near
Eastern Languages. But I'm following this group because the Linux
community seems unusually responsive to internationalization issues, at
least on the discussion level. (Apps. don't seem to be moving along
in this direction.) Part of the problem is that information isn't all
that widely disseminated (at least in the US) about how other scripts
and encoding systems work. Programmers just don't have the basic info
they need. And few Americans understand how Asian or Middle Eastern
or Indian scripts work (the bidirectional wordwrap algorithm, for ex-
ample, baffles many of them - not because they're dumb, but because of
simple lack of exposure to Arabic, Hebrew, etc.).
As a first step, everyone ought to peruse the comp.software.internation-
alization FAQ, which at least sets forth how to use ANSI C setlocale.
A second step is to buy the Unicode manual, now out of date (but still
useful enough):
The Unicode Standard Version 1.0
two volumes
Addison Wesley, 1990,91
At the end of that volume is a horrendous account of the bi-di wordwrap
algorithm that is guaranteed to mystify anyone who has not studied Arabic
and/or Hebrew. If anyone gets this far, and wants clean, simple info,
please contact me directly. I'll post my own informal, programmer ori-
ented description of the bidi algorithm if asked.
For UTF-8, check out http://www.stonehand.com/unicode/standard/utf8.html.
Basically, it works as follows:
* Bits Hex Min Hex Max Byte Sequence in Binary
* 7 00000000 0000007f 0vvvvvvv
* 11 00000080 000007FF 110vvvvv 10vvvvvv
* 16 00000800 0000FFFF 1110vvvv 10vvvvvv 10vvvvvv
* 21 00010000 001FFFFF 11110vvv 10vvvvvv 10vvvvvv 10vvvvvv
* 26 00200000 03FFFFFF 111110vv 10vvvvvv 10vvvvvv 10vvvvvv 10vvvvvv
* 31 04000000 7FFFFFFF 1111110v 10vvvvvv 10vvvvvv 10vvvvvv 10vvvvvv 10vvvvvv
*
* The UCS value is just the concatenation of the v bits in the multibyte
* encoding. When there are multiple ways to encode a value, for example
* UCS 0, only the shortest encoding is legal.
The idea is that no UTF-8 sequence can be confused with ASCII codes, since
the first 128 places (if I understand correctly) constitute a compatibility
zone. Note that some space is lost in storage, but compression takes care
of this nicely. Internally, one can use whatever one pleases (16-bit
chars are sufficient for Unicode).
I hope this helps. The Linux community is an interesting, energetic
bunch.
--
Richard L. Goerwitz *** goer@midway.uchicago.edu